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Abstract: This report addresses state inference for hidden Markov models. These models rely 
on unobserved states, which often have a meaningful interpretation. This makes it necessary to 
develop diagnostic tools for quantification of state uncertainty. The entropy of the state sequence 
that explains an observed sequence for a given hidden Markov chain model can be considered as 
the canonical measure of state sequence uncertainty. This canonical measure of state sequence 
uncertainty is not reflected by the classic multivariate state profiles computed by the smoothing 
algorithm, which summarizes the possible state sequences. Here, we introduce a new type of profiles 
which have the following properties: (i) these profiles of conditional entropies are a decomposition 
of the canonical measure of state sequence uncertainty along the sequence and makes it possible to 
localize this uncertainty, (ii) these profiles are univariate and thus remain easily interpretable on 
tree structures. We show how to extend the smoothing algorithms for hidden Markov chain and 
tree models to compute these entropy profiles efficiently. 
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Localisation de l'incertitude canonique de structures 
latentes : profils d'entropie pour des modeles de Markov 

caches. 

Resume : Ce rapport concerne l'inference sur les etats de modeles de Markov caches. Ces 
modeles se fondent sur des etats non observes, qui ont en general une interpretation, dans le 
contexte d'une application donnee. Ceci rend necessaire la conception d'outils de diagnostic pour 
quantifier l'incertitude sur ces etats. L'entropie de la sequence d'etats associee a une sequence 
observee, pour un modele de chaine de Markov cachee donne, peut etre consideree comme la 
mesure canonique de l'incertitude sur les etats. Cette mesure canonique d'incertitude sur la 
sequence d'etats n'est pas refletee par les profils d'etats, multivaries, calcules par l'algorithme de 
lissage, qui resume les sequences d'etats possibles. Nous introduisons ici de nouveaux profils dont 
les proprietes sont les suivantes : (i) ces profils d'entropie conditionnelle sont une decomposition, 
le long de cette sequence, de la mesure canonique d'incertitude sur la sequence d'etats, ce qui 
offre la possibilite d'une localisation de cette incertitude, (ii) ces profils sont univaries; ils peuvent 
done etre facilement utilises sur des structures arborescentes. Nous montrons comment etendre 
l'algorithme de lissage sur des chaines et arbres de Markov caches afin de calculer ces profils de 
maniere efficace. 

Mots-cles : Entropie conditionnelle, modeles de chaines de Markov cachees, modeles d'arbres 
de Markov caches, analyse de l'architecture des plantes 
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1 Introduction 

Hidden Markov chain models have been widely used in signal processing and pattern recognition, 
for the analysis of sequences with various types of underlying structures - for example succession 
of homogeneous zones, or noisy patterns (Ephraim & Mehrav, 2002; Zucchini & MacDonald, 
2009). This family of models was extended to other kinds of structured data, and particularly 
to tree graphs (Crouse et al., 1998). Concerning statistical inference for hidden Markov models, 
we distinguish inference for the unobserved state process from inference for model parameters 
(Cappe et al., 2005). Our focus here is state inference and more precisely the uncertainty in state 
sequences. 

State inference is particularly relevant in numerous applications where the unobserved states 
have a meaningful interpretation. In such cases, the state sequence has to be restored. The 
restored states may be used, typically, in prediction, in segmentation or in denoising. For example 
Ciriza et al. (2011) proposed to optimize the consumption of printers by prediction of the future 
printing rate from the sequence of printing requests. This rate is related to the parameters of 
a hidden Markov chain model, and an optimal timeout (time before entering sleep mode) is 
derived from the restored states. Le Cadre & Tremois (1998) used a vector of restored states in a 
dynamical system for source tracking in sonar and radar systems. Such use of the state sequence 
makes assessment of the state uncertainty particularly important. 

Not only is state restoration essential for model interpretation, it is generally also used for 
model diagnostic and validation, for example based on the visualization of functions of the states. 
The use of restored states in the above-mentioned contexts raises the issue of quantifying the 
state sequence uncertainty for a given observed sequence, once a hidden Markov model has been 
estimated. Global quantification of this uncertainty is not sufficient for a precise diagnosis: it 
is also very important to locate this uncertainty along the sequence, for instance to differentiate 
zones that are non-ambiguously explained from zones that are ambiguously explained by the 
estimated model. We have introduced the statistical problem of quantifying state uncertainty in 
the case of hidden Markov models with discrete state space for sequences, but the same reasoning 
applies to other families of latent structure models, including hidden semi-Markov models and 
hidden Markov tree models. 

Methods for exploring the state sequences that explain a given observed sequence for a known 
hidden Markov chain model may be divided into three categories: (i) enumeration of state 
sequences, (ii) state profiles, which are state sequences summarized in a J x T array where J 
is the number of states and T the length of the sequence, (iii) computation of a global measure 
of state sequence uncertainty. The entropy of the state sequence that explains an observed 
sequence for a known hidden Markov chain model was proposed as a global measure of the 
state sequence uncertainty by Hernando et al. (2005). We assume here that this conditional 
entropy is the canonical measure of state sequence uncertainty. Various methods belonging to 
these three categories have been developed for different families of hidden Markovian models, 
including hidden Markov chain and hidden semi-Markov chain models; see Guedon (2007) and 
references therein. We identified some shortcomings of the proposed methods: 

• The entropy of the state sequence is not a direct summary of the state profiles based on the 
smoothed probabilities, due to the marginalization that is intrinsic in the computation of 
smoothed probabilities. We show that the uncertainty reflected in the classic multivariate 
state profiles computed by the smoothing algorithm can be summarized as an univariate 
profile of marginal entropies. Each successive marginal entropy quantifies the uncertainty 
in the corresponding posterior state distribution for a given time t. The entropy of the 
state sequence, in contrast, can be decomposed along the sequence as a profile of condi- 
tional entropies where the conditioning refers to the preceding states. Using results from 
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information theory, we show that the profile of conditional entropies is pointwise upper- 
bounded by the profile of marginal entropies. Hence, the classic state profiles tend to 
over-represent the state sequence uncertainty and should be interpreted with caution. 

• Due to their multidimensional nature, state profiles are difficult to visualize and interpret 
on trees except in the case of two-state models. 

Our objective is to propose efficient algorithms for computing univariate profiles of conditional 
entropies. These profiles correspond to an additive decomposition of the entropy of the state 
process along the sequence. As a consequence, each term of the decomposition can be interpreted 
as a local contribution to entropy. This principle can be extended to more general supporting 
structures: directed acyclic graphs (DAGs), and in particular, trees. Each contribution is shown 
to be the conditional entropy of the state at each location, given the past or the future of the state 
process. This decomposition allows canonical uncertainty to be localized within the structure, 
which makes the connection between global and local uncertainty easily apprehensible, even for 
hidden Markov tree models. In this case, we propose to compute in a first stage an univariate 
profile of conditional entropies that summarizes state uncertainty for each vertex. In a second 
stage, the usual state profiles computed by the upward-downward algorithm (Durand et al, 2004), 
or an adaption to trees of the forward-backward Viterbi algorithm of Brushe et al. (1998), are 
visualized on selected paths of interest within the tree. This allows for identification of alternative 
states at positions with ambiguous state value. 

The remainder of this paper is structured as follows. Section [5] focuses on algorithms to 
compute entropy profiles for state sequences in hidden Markov chain models. These algorithms 
are based on conditioning on either the past or the future of the process. In Section|21 an additive 
decomposition of the global state entropy is derived for graphical hidden Markov models indexed 
by DAGs. Then algorithms to compute entropy profiles conditioned on the parent states and 
conditioned on the children states are derived in detail in the case of hidden Markov tree models. 
The use of entropy profiles is illustrated in Section @] through applications to sequence and tree 
data. Section [5] consists of concluding remarks. 

2 Entropy profiles for hidden Markov chain models 

In this section, definitions and notations related to hidden Markov chain (HMC) models are 
introduced. These are followed by reminders on the classic forward-backward algorithm and 
the algorithm of Hernando et al. (2005) to compute the entropy of the state sequence. These 
algorithms form the basis of the proposed methodology to compute the state sequence entropy, 
as the sum of local conditional entropies. 

2.1 Definition of a hidden Markov chain model 

A J-state HMC model can be viewed as a pair of discrete-time stochastic processes (S, X) = 
(St, X t ) t=Q 1 where S is an unobserved Markov chain with finite state space {0, . . . , J — 1} and 
parameters: 

• 7r j = P (Sq = j) with Ttj = I (initial probabilities), and 

• pij = P (St = j\St—i = i) with ^2jPij = 1 (transition probabilities). 

The output process X is related to the state process S by the emission (or observation) proba- 
bilities 

bj (x) =P(X t = x\S t = j) with h J («) = !■ 
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Since the emission distributions (bj)j = o j_i are such that a given output x may be observed 

in different states, the state process S cannot be deduced without uncertainty from the outputs, 
but is observable only indirectly through output process X. To simplify the algorithm presen- 
tation, we consider a discrete univariate output process. Since this work focuses on conditional 
distributions of states given the outputs, this assumption is not restrictive. 

In the sequel, Xq = x is a shorthand for Xq = xq 1 . . . , X t = x t (this convention transposes to 
the state sequence Sq = s ). For a sequence of length T, X ~ = x Q ~ is simply noted X = x. 
In the derivation of the algorithms for computing entropy profiles, we will use repeatedly the fact 
that if (St)t=o,i is a first-order Markov chain, the time-reversed process is also a first-order 
Markov chain. 



2.2 Reminders: forward-backward algorithm and algorithm of Her- 
nando et al. (2005) 

The forward-backward algorithm aims at computing the smoothed probabilities L t (j) = P(St = 
j\X = x) and can be stated as follows (Devijver, 1985). The forward recursion is initialized at 
t — and for j = 0, . . . , J — 1 as follows: 



N t =P(X t =x t \X t rT 1 =x t ~ 



o 

t-i 



J2 p ( S t - 3,Xt =*t\Xo 



with 
and 



(1) 



F (j) = P(S =j\X = x ) 

b i (go) 
= -NT*'- 

The recursion is achieved, for t = 1, . . . , T — 1 and for j = 0, . . . , J — 1, using: 

F t (j)=P(S t =j\X t = x*) 

= h -^T.^-^)- ( 2 ) 

i 

The normalizing factor Nt is obtained directly during the forward recursion as follows 



P {So = j, X = X ) = bj (.T ) TTj , 

P {S t = j, X t = x t \X^ = .T*- 1 ) = bj (x t ) PijFt-i (i) ■ 

i 

The backward recursion is initialized at t = T — 1 and for j = 0, . . . , J — 1 as follows: 

Lt—i (j) = P (St-i = j\X = x) = Ft^ (j) . (3) 
The recursion is achieved, for t = T — 2, . . . , and for j = 0, . . . , J — 1, using: 

Lt (j) = P(S t = j\X = x) 



RR n° 7896 



6 



Durand & Guedon 



where 

Gt+x (k) = P (S t+1 = k\Xl = 4) 

= E i>^>'' V) ■ 

These recursions rely on conditional independence properties between hidden and observed vari- 
ables in HMC models. Several recursions given in Section [5] rely on the following relations, due 
to the time-reversed process of (St, X t ) t=0 1 being also a hidden first-order Markov chain: for 
t = 1, . . . , T — 1 and for i, j = 0, . . . , J — 1, 

P (S t -i = i\S t = j, X = x) = P (S^ = i\S t = j, X' = 4) 

= P(S t -i = i\S t =3,X t - 1 = x*f 1 ), 

p {sl 1 = s^iSt = j,x = x) = p (s*- 1 = s^lSt = j,X* = 4) 

= p(s*- 1 = 4- 1 \s t = j,x t - 1 = x t - 1 ). 

An algorithm was proposed by Hernando et al. (2005) for computing the entropy of the state 
sequence that explains an observed sequence in the case of an HMC model. This algorithm 
includes the classic forward recursion given by ([1]) and ([2]) as a building block. It requires a 
forward recursion on entropies of partial state sequences Sq. (In the sequel, it is understood that 
the entropy of hidden state variables refers to their conditional entropies given observed values.) 
This algorithm is initialized at t = 1 and for j = 0, . . . , J — 1 as follows: 

H(S \S 1 =j,X^x 1 ) 

= - E P ( 5 ° = ^ = 3, Xl = 4) log P (So =i\S 1= j, X 1 = x\) . (5) 

i 

The recursion is achieved, for t = 2, . . . , T — 1 and for j = 0, . . . , J — 1, using: 

H(s t - 1 \s t = j,x t = 4) 

£ P ( S o^ = = 3, Xt = 4) log P (S^ 1 = s'- 1 ^ = j, Xl = x\) 

SQ 7 ...,St—l 

E E p ( s o~ 2 = s l 2 \St-i =i,s t = j, x* = 4) p (s^ = i\s t = j, x* = 4) 

So,...,St-2 i 

x {logP (Si 2 = sl^St-x = i, S t = j,X* = 4) + logP(S t -i = i\S t = j,X* = 4)} 

= -J2P(s t -i = i\s t = j,x^ = 4- 1 )| E P(s t - 2 = s - 2 \s t - 1 = i,x^ = 4- 1 ) 

i I So,...,St_2 

x log-p (si 2 = 4- 2 |5 t _! = i,xl- x = 4" 1 ) + bgp (St-i = i\s t = j,xl = 4) 
= E P (St-i = i\s t = j, xl 1 = 4- 1 ) {H (sl 2 \s t ^ = i, xl 1 = 4- 1 ) 

i 

- logP(S t -i = i\S t = j, Xl 1 = 4- 1 ) } , (6) 
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with 



P(St-i = i\S t = j,X t 
_ P(St=j,S t -i=i\X { 



P(S t = j\Xl 1 = xf 1 ) 
p l3 F t ^i (i) 
Gt(j) ' 

The forward recursion @ is a direct consequence of the conditional independence properties 
within a HMC model and can be interpreted as the chain rule 

H(S t - 1 \S t =j,X t = x t ) 

= H (S^ISt-i, S t = j, X l =xl)+H {S t -x\S t = J, Xl = 4) (7) 

with 

H (S t - 2 \S t - 1 ,S t = j,X t =x t ) 



Y PiS^ 1 = s t ~ 1 \S t =j,X t = 4) x logP(5*- 2 = 4- 2 \S t ^ = s t . 1 ,S t =j,X t = 

SO,--;St-l 

= -J2P(S t - l = i\S t =j,X t = x t ) Y P(S t - 2 = s t - 2 \S t - 1 =i,X t - 1 = x t - 1 ) 

i So, — ,St_2 

x io g p(^- 2 = 4- 2 |s t _! = - 

= Y / p (St-i = i\St = j,**- 1 = HiS^lSt^ = z,^" 1 = xl 1 ) 

i 

and 



Xn 



H(S t - 1 \S t = j,X t = 

= - E p ( 5 *-! =i \St = 3-xt 1 = xt l ) bgp (St-i = i\s t = hx*- 1 = 4- 1 ) 

i 

Using a similar argument as in ([5]), the termination step is given by 
H(S^- 1 \X = x) 

= -J2P(ST-i=j\X = x)l Yl P(S^- 2 = s^- 2 \S T . 1 =j,X = x) 

j ^S ,...,ST-2 

x logP (S%- 2 = s t- 2 \S T -i = j,X = x) + logP (St-! = j\X = x) 

= Y Ft-i (J) {H {S^ 2 \S T -i = j, X = x)- logP T -i (j)} ■ (8) 

j 

The forward recursion, the backward recursion and the algorithm of Hernando et al. (2005) all 
have complexity in 0(J 2 T). 
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2.3 Entropy profiles for hidden Markov chain models 

In what follows, we derive algorithms to compute entropy profiles based on conditional and partial 
entropies. Firstly, conditioning with respect to past states is considered. Then, conditioning with 
respect to future states is considered. 

The proposed algorithms have a twofold aim, since they focus in computing both 

• profiles of partial state sequence entropies (H(Sq\X = x)) t _ T _ 1 

• profiles of conditional entropies (H(St \St-i, X = x)) t _ T V 

We propose a first solution where the partial state sequence entropies are computed beforehand, 
and the conditional entropies are deduced from the latter. Then, we propose an alternative 
solution where the conditional entropies are computed directly, and the partial state sequence 
entropies are extracted from these. 

The profiles of conditional entropies have the noteworthy property that the global state 
sequence entropy can be decomposed as a sum of entropies conditioned on the past states: 

T-1 

H(S\X = x) = H (S \X = x) + Y,H (St\S t -!,X = x) . (9) 

i=l 

This property comes from the fact that the state sequence S is conditionally a Markov chain 
given X = x. 

In this way, the state sequence uncertainty can be localized along the observed sequence, 
H(St\St-i, X = x) representing the local contribution at time t to the state sequence entropy. 
For t = 0, . . . , T — 1, using conditional independence properties within HMC models, we have 

H(S t Q \X = x) 

= - E P(S t = s t \X = x)logP(S t = s t \X = x) 

So,—, St 

= -J2P(St = j\X = x)\ E P(St 1 = 8t- 1 \S t = j,X t = xt ) ) 

j ^so,—,st_i 

x log P (S*- 1 = 4" 1 \S t = j, Xl = 4) + log P (S t = j\X = as) I 
= Y^Lt (j) {H (St 1 \S t =j,Xl = 4) - IogL t (j)} (10) 

3 

= Y Lt Ci) H {St^St = J, Xl =x t )+H (S t \X = x) . 

3 

Using a similar argument as in ([7]), equation (|10j) can be interpreted as the chain rule 

H (S' \X = x) = H (S^^St, X = x)+H{S t \X = x) 

In this way, the profile of partial state sequence entropies (H(Sq\X = x)) t=0 T _ 1 can be 
computed as a byproduct of the forward-backward algorithm where the usual forward recursion 
(J2J) and the recursion (JB|) proposed by Hernando et al. (2005) are mixed. The conditional 
entropies are then directly deduced by first-order differencing 

H(S t \S t -i,X = x)=H (S t \S t - 1 ,X = x) 

= H(S t \X = x)-H(S t - 1 \X = x). (11) 



Inria 



Entropy profiles for hidden Markov models 



9 



As an alternative, the profile of conditional entropies (H(St\St-i, X = x)) t=0 T _ 1 could 
also be computed directly, as 

H(S t \S t -i,X = x) 

= -J2P(St= J, S t -! =i\X = x) log P (S t = j\St-i =i,X = x) (12) 

with 

f P{S t =j\S t - 1 =i,X = x) =L t (j)p ij F t - 1 (i)/{G t (j)L t - 1 (i)} and 
\P{St = j, St-i = i\X = x) =Lt (j)PijFt-i (*) /G t (i) . 

These latter quantities are directly extracted during the backward recursion (0]) of the forward- 
backward algorithm. 

In summary, a first possibility is to compute the profile of partial state sequence entropies 
(H(Sf,\X = x)) t=Q T _ 1 using the usual forward and backward recursions combined with ([5]), 
(O and (fTU|) . from which the profile of conditional entropies (H(St\St-i, X = x)) t=0 T _ 1 is 
directly deduced by first-order differencing (fTTj) . A second possibility is to compute the profile 
of conditional entropies directly using the usual forward and backward recursions combined with 
(|12p and to deduce the profile of partial state sequence entropies by summation. The time 
complexity of both algorithms is in 0(J 2 T). 

The conditional entropy is bounded from above by the marginal entropy (Cover feThomas, 2006, 
chap. 2): 

H(St\S t -i,X = x)<H(S t \X = x), 

with 

H(S t \X = x) = -Y / P (St = j\X = x) logP(S t = j\X = x) 

3 

= (i)iogi* (j). 

3 

and the difference between the marginal and the conditional entropy is the mutual information 
between St and St—i, given X = x. Thus, the marginal entropy profile (H(St\X = x)) t=Q T _ 1 
can be viewed as pointwise upper bounds on the conditional entropy profile (H(St\St-i,X = x)) t=l 
The profile of marginal entropies can be interpreted as a summary of the classic state profiles 
given by the smoothed probabilities (P(St = j\X = x)) t=0 x-i-j=o j-i- Hence, the differ- 
ence between the marginal entropy H(St\X = x) and the conditional entropy H(St\St-i, X = x) 
can be seen as a defect of the classic state profiles, which provide a representation of the state 
sequences such that global uncertainty is overestimated. 

Entropy profiles conditioned on the future for hidden Markov chain models The 

Markov property, which states that the past and the future are independent given the present, 
essentially treats the past and the future symmetrically. However, there is a lack of symmetry 
in the parameterization of a Markov chain, with the consequence that only the state process 
conditioned on the past is often investigated. However, the state uncertainty at time t may be 
better explained by the values of future states than past states. Consequently, in the present 
context of state inference, we chose to investigate the state process both forward and backward 
in time. 

Entropy profiles conditioned on the future states rely on the following decomposition of the 
entropy of the state sequence, as a sum of local entropies where state St at time t is conditioned 
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on the future states: 

T-2 

H(S\X = x) =Y / H(S t \St + i,X=x)+H(S T -i\X = x), 
t=o 

This is a consequence of the reverse state process being a Markov chain, given X = x. 

An algorithm to compute the backward conditional entropies H(Sf^[ 1 \St = j, = x^ 1 ) 

can be proposed. This algorithm, detailed in Appendix lA.il is similar to that of Hernando et al. 
(2005) but relies on a backward recursion. Using similar arguments as in (|10|) . we have 

H [Sf-^X = x)=Y^L t (j) {H (SfcM = j, Af+1 1 = x^) - \ogL t (j)} . (14) 

o 

Thus, the profile of partial state sequence entropies (ff(5'^ _1 |X = cc)) t _ T _ 1 can be computed 
as a byproduct of the forward-backward algorithm, where the usual backward recursion ^ and 
the backward recursion for conditional entropies (see Appendix lA.ljl are mixed. The conditional 
entropies are then directly deduced by first-order differencing 

H (S t \S t+u X = x)=H {StlSf-^X = x) 

= H {Sf-^X = x)-H {Sj-^X = x) . 

The profile of conditional entropies (H(St\St+i, X = x)) t=0 T _ 1 can also be computed directly, 

as 

H (S t \S t+u X =x) = -J2 p (St= j, S t +i = k\X = x) \ogP(S t = j\S t+ i =k,X = x) 
with 

P {S t = j\St+i = k,X = x)=P (S t = j\S t+1 = k, A* = 4) 

= P 3 kF t {j) j G t +i (k) and 
P (S t = j, S t+1 = k\X = x)= L t+1 (k) p 3k F t (j) /Gt+i (k) . 

The latter quantities are directly extracted during the forward @ and backward recursions 
([?]) of the forward-backward algorithm. The conditional entropy is bounded from above by the 
marginal entropy (Cover & Thomas (2006), chap. 2): 

H(St\S t+ i,X = x) <H(S t \X = x). 

3 Entropy profiles for hidden Markov tree models 

In this section, hidden Markov tree (HMT) models are introduced, as a particular case of graph- 
ical hidden Markov (GHM) models. A generic additive decomposition of state entropy in GHM 
models is proposed, and its implementation is discussed in the case of HMT models. 

3.1 Graphical hidden Markov models 

Let Q be a directed acyclic graph (DAG) with vertex set U, and S = (S u ) u ^u be a J-state 
process indexed by U. Let Q(S) be the graph with vertices S, isomorphic to Q (so that the 
set of vertices of Q(S) may be assimilated with U). It is assumed that S satisfies the graphical 
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Markov property with respect to G(S), in the sense defined by Lauritzen (1996). The states S u 
are observed indirectly through an output process X = (X u ) u ^u such that given S, the (X u ) u ^u 
are independent, and for any u, X u is independent of (S v ) V £U;v^u given S u . Then process X is 
referred to as a GHM model with respect to DAG Q. 

Let pa(it) denote the set of parents of u € U. For any subset E of U, let Se denote (S u ) ue E- 
As a consequence from the Markov property of S, the following factorization of Ps holds for any 
s (Lauritzen, 1996): 

P{S = s) = Y[P(S U = S u \S pa ( u ) = Sp a(u )), 
u 

where P(S U = Su|S'pa(u) = s pa(«)) must be understood as P(S U = s u ) if pa(u) = 0. This 
factorization property is shown by induction on the vertices in IA, starting from the sink vertices 
(vertices without children), and ending at the source vertices (vertices without parents). 

In the particular case where Q is a rooted tree graph, X is called a hidden Markov out- 
tree with conditionally-independent children states, given their parent state (or more shortly, a 
hidden Markov tree model). This model was introduced by Crouse et al. (1998) in the context 
of signal and image processing using wavelet trees. The state process S is called a Markov tree. 

The following notations will be used for a tree graph T: for any vertex u, c (u) denotes the 
set of children of u and p (u) denotes its parent. Let T u denote the complete subtree rooted at 
vertex u, X u = x u denote the observed complete subtree rooted at u, X c r u \ = x c ua denote the 
collection of observed subtrees rooted at children of vertex u (that is, subtree x u except its root 
x u ), X u \ v = x u \ v the subtree x u except the subtree x v (assuming that x v is a proper subtree of 
x u ), and finally Xf,r u -\ = xu u ) the family of brother subtrees {X v ) v< z p ^ u y v ^ u of u (assuming that 
u is not the root vertex). This notation transposes to the state process with for instance S u = s u , 
the state subtree rooted at vertex u. In the sequel, we will use the notation U = {0, . . . , n — 1} 
to denote the vertex set of a tree with size n, and the root vertex will be u = 0. Thus, the entire 
observed tree can be denoted by Xo = Xo, although the shorter notation X = x will be used 
hereafter. These notations are illustrated in Figure [T] 

A J-state HMT model (S, X) = (S u ,X u ) ue u is defined by the following parameters: 

• initial probabilities (for the root vertex) nj = P (So = j) with nj =1, 

• transition probabilities pjk = P [S u = k\S p r u \ = j) with ^2 k Pjk = 1, 

and by the emission distributions defined as in HMC models by P(X U = x\S u = j) = bj(x). 
In GHM models, the state process is conditionally Markovian in the following sense: 

Proposition 1 Let (S,X) be a GHM model with respect to DAG Q. Then for any x, the 

conditional distribution of S given X = x satisfies the Markov property on Q and for any s, 

P(S = s\X = x) = W_P(S U = S u \S pa (u) = s pa (u),X = x), 

u 

where P(S U = s„|Spo( u ) = s pa ( u) ,X = x) denotes P(S U = s u \X = x) if pa(u) = 0. 

Proof To prove this proposition, we consider a potential realization (s, x) of process (S, X). We 
introduce the following definitions and notations: for u £ U, An(u) denotes the set of ancestors 
of u in Q; for Ac U, An(A) = {An(u)} ueA and An(A) = An(A) U A. Let Sa = »A denote the 
state process indexed by the graph induced by A. By conditional independence of the (X u ) uG u 
given S, the process (S, X) follows the Markov property on the DAG Q(S,X) obtained from 
G(S) by addition of the set of vertices {X u \u 6 U] and the set of arcs {(S u , X u )\u £ U}. 



RR n° 7896 



12 



Durand & Guedon 




It is proved by induction on subgraphs A of Q that if An(A) = A, then 

P(S A = s A \X = x) = Y[ P{S V = s v \S pau(v) = s pau(v)l X = x). (15) 

vEA 

Since the joint distribution of state vertices in different connected components (C?i, . . . ,Gc) of 
Q can be factorized as W c P{Sg c = sgjX = x), equation (fT5|) is proved separately for each 
connected component. 

It is easily seen that if u is a source of Q, both the right-hand and the left-hand sides of 
equation (|15p are equal to P{S U = s u \ X = x). To prove the induction step, we consider a vertex 
u ^ A such that pa(it) C A. If such vertex does not exist, A is a connected component of Q, 
which terminates the induction. 

Otherwise, let A' denote A U {u}. Then An(A') = A' and 

P(Sa' = s A '\X = x) =P{S U = s u |S pa („) = Sp a („), S A \pa(u) = s A\pa,(u),X = x) 

x P(Spa(u) = s pa(u); ^A\pa(u) = s A\p&(u)\X = x) 

=P(S U = s„|S pa („) = s pa(u ),X = x)P(S A = s A \X = x) 

since the Markov property on G(S, X) implies conditional independence of S u and SA\pa.(u) 
given 5 pa („) and X. 

The proof is completed by application of induction equation (|15p . ■ 

From application of the chain rule (Cover & Thomas, 2006, chap. 2) to Proposition [1] the 
following corollary is derived: 

Corollary 1 Let (S, X) be a GHM model with respect to DAG Q. Then for any x, 

H(S\X = x)=J2 H (Su\Spa{u),X = x), 
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where H(S u \Sp a ( u ),X = x) denotes H(S U \X = x) if pa{u) = 0. 

This result extends equation ^ for HMC models to hidden Markov models indexed by DAGs. 

It follows from Corollary [1] that the global entropy of the state process can be decomposed as 
a sum of conditional entropies, where each term is the local contribution of state S u at vertex u, 
and corresponds to the conditional entropy of this state given the parent state (or equivalently, 
given the non-descendant states, from the Markov property on Q(S, X)). 

The remainder of this Section focuses on the derivation of algorithms to compute H (S\X = x) 
efficiently in HMT models. 

3.2 Reminder: upward-downward algorithm 

The upward-downward algorithm aims at computing the smoothed probabilities £ u {j) = P{S U = 
j\X = x) and can be stated as follows (Durand et ai, 2004). It consists in three recursions, 
which all have complexities in 0(J 2 n). 

This algorithm requires preliminary computation of the state marginal probabilities P(S U = 
j), computed by a downward recursion. This recursion is initialized at the root vertex u = and 
for j = 0, . . . , J — 1 as follows: 



The recursion is achieved, for vertices m^O taken downwards and for j = 0, . . . , J — 1, using: 



The recursion is achieved, for internal vertices u taken upwards and for j = 0, . . . , J — 1, using: 



P(S =j)=TT j . 



P(S U = j) = ^2ptjP(S p (u) = i)- 



The upward recursion is initialized for each leaf as follows. For j = 0, . . . , J — 1 



f3 u (j) =P(S u =j\X u =x u ) 
_ bj(x u )P(S u = j) 
N u 



p(x u = x u \s p(u) = j) 

P{X U = x u ) 



E Pu(k)pjk 
P(S u = k) 



and 



0uW =P(& 



u. 



j\X u = x u ) 




The normalizing factor N u is obtained directly during the upward recursion by 



N u = P(X, 



U 



u 



) = ^b j (x u )P(S u =j) 



j 
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for the leaf vertices, and 

P(X U = x u ) 



N„ 



n p(x v = x v ) 



J2{ II PuAj))Hxu)p(s u = 3 ) 



for the internal vertices. 

The downward recursion is initialized at the root vertex u = and for j = 0, . . . , J — 1 as follows: 

£o(j) = P(S = j\X = x) = 
The recursion is achieved, for vertices w^O taken downwards and for j = 0, . . . , J — 1, using: 



p(u) l z J 



= i) ; 



(16) 



These recursions rely on conditional independence properties between hidden and observed vari- 
ables in HMT models. In several recursions given in Section [31 the following relations will be 
used: for any internal, non-root vertex it and for j = 1, . . . , J, 

P(Sc(u) = Sc{u)\Su = j, So\u = Sq\ u ,X = x) 

= P{S C ( U ) = 8 C ( U ) \S U — j, Sp( u ) = S p ( u ) , X = x) 

= P(S c(„) = s c ( u )[Su =j,X = x) 
= I] P(S„ = S,,|S u =j,X = x) 



P(5 U = s„|S \« = s \u,-X = x) 



P(S„ = SujS, 



P(S, 



p(u) 
p(u) 



'p(tt) 
s p(u); 



x) 

— Xu). 



3.3 Algorithms for computing entropy profiles for hidden Markov tree 
models 

In HMT models, the generic decomposition of global state tree entropy yielded by Corollary Q] 
writes 

H(S\X = x) = H(S \X = x) + J2 H(S u \S p(u) ,X = x). 

As in the case of HMC models, such decomposition of H(S\X = x) along the tree structure 
allows the computation of entropy profiles, which rely on conditional and partial state entropies. 

In a first approach, conditional entropies H{S u \S p ( u ) , X = x) are directly extracted during 
the downward recursion (|16[) . Then the conditional entropies H (S u \S p ( u ) , X = x) and the partial 
state trees entropies H(S U \X = x) are computed using an upward algorithm that requires the 
results of the upward-downward recursion. They are also used in a downward recursion to 
compute profiles of partial state tree entropies H(S \ U \X = x). 

In a second approach, conditional entropies H{S C ^ \ S U = j, X u = x u ) are computed directly 
during the upward recursion given in Section 13.21 without requiring the downward probabilities 
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^ u (j). These conditional entropies are used to compute directly profiles of partial state tree 
entropies H(S U \X = x) and H(Sq\ u \X = x). 

We also provide an algorithm to compute conditional entropies given the children states 
H(S u \S c ( u j, X = x). We show that contrarily to H(S u \S p ^, X = x), these quantities do not 
correspond to local contributions to H(S\X = x), but their sum over all vertices u is lower 
bounded by H(S\X = x). 



Computation of partial state tree entropy using conditional entropy of state subtree 
given parent state Firstly, for every non-root vertex u, the conditional entropy 

H(S u \S p[u) ,X = x) 

= -J2 P(S U = j, S p(u) = i\X = x) \ogP(S u = j\S p{u) =i,X = x), (17) 

is directly extracted during the downward recursion (|16|) . similarly to (fT5|) for HMC models, with 

P{S U = j\S p(u) = i, X = x) = /3 u {j)pij/ {P{S U = j)/3 p ( u ),„(i)} and , . 

P(S U = j, S p{u} = i\X = x) = p u {j)Pijti P {u)(i)/{P{S u = j)P P (u),u(i)}- 1 ' 

The partial state tree entropy H(S u \S p ^, X = x) is computed using an upward algorithm. 
Initialization is achieved at the leaf vertices u using equation (|17ll . 
The recursion is given, for all non-root vertices u taken upwards, by: 

H(S U \S P ( U ),X = x) =H(S c ( u) \S u , S p (u),X = x) + H(S U \S P ( U ),X = x) 

= ]T H(S V \S U ,X = x) + H(S u \S p{u) ,X = x). (19) 

Equation (|19p can be interpreted as the chain rule 

H(S a \S p(u) ,X = x) = H(S u \S p(uh X = x)+Y, H(S v \S p(v)l X = x), (20) 

veil 

deduced from factorization 

P(S U = s u \S p ( u ) = s p{u) ,X = x) =P{S U = s u \S p ( u ) = s p ( u -),X = x) 



x 
ve%. 



[ P{S V — s v \S p ( v ) — s p ( v ),X — x), 



which is similar to Proposition [T] An analogous factorization yields 

H(S U \X = x) =H(S c{u) \S u ,X = x)+H(S u \X = x) (21) 
= J2 H(S V \S U ,X = x)+H(S u \X = x). 

Thus, profiles of partial state tree entropies (H(S U \X = x)) ueU can be deduced from (H(S u \S p ( u ), X = ^)) ugU 
and the marginal entropies 

H(S U \X = x) = -J2Uj) log Uj)- 

3 

The global state tree entropy H(S\X = x) is obtained from (|2"Tj) at root vertex u = 0. 
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Profiles of partial state tree entropies [H(Sq\ u \X = x )) uGU can also be computed using the 
following downward recursion, initialized at every child u of the root vertex by 

H(S 0Vl \X = x) =H(S a \X = x) + H(S b{u) \S ,X = x) 

=H(S \X=x)+ ]T H(S v \S ,X=x). (22) 

The downward recursion is given at vertex v with parent u = p(v) by 
H(S \ v \X = x) 

= H(S U1 S Kv) \S \ u ,X = x) + H(S \ U \X = x) 

= H{S b{v) \S Ul S \ U ,X =x) + H(S U \S \ U , X = x) + H(S \ U \X = x) 

= Y, H(S w \S p(w) ,X = x)+H(S u \S p(u) ,X = x) + H(S \ u \X = x), (23) 

where for any w G b(v), p(w) = u. 

Note that equations ([2U)l , (|2"2"|) and ([2^)1 can be written under the same form: if V is a subtree of 
T, then the entropy of state subtree S\> is 

H(S V \X = x) = Y H(S V \S P{V) ,X = x), 

where H (S v \S p r v ), X = x) refers to H(S V \X = x) if v is the root vertex or if p(v) does not 
belong to V. 

Recursion (|23|) can be terminated at any leaf vertex u using the following equation: 

H(S\X = x) =H(S U \S Q \ U , X = x) + H(S \ U \X = x) 
=H(S u \S p{uh X = x) + H(S \ U \X = x). 

In summary, the profile of conditional entropies [H{S u \S p ^, X = x )) u€U is firstly computed 
using (|T7)) . The conditional entropies are used in (fTi?)) to derive the partial state tree entropies 
H(S u \S p ( u } , X — x), which are combined with the marginal entropies in (|21[) to derive profiles 
of partial state tree entropies (H(S U \X = x )) u€U - They are also combined with the conditional 
entropies in ([2lt]) to compute the profiles (H(Sq\ u \X = x )) ueU - The time complexity of the 
algorithm is in 0(J 2 n). 

As in HMC models, the marginal entropy profile (H(S U \X = x)) u( - u can be viewed as point- 
wise upper bounds on the conditional entropy profile (H(Su\S p ( u -), X = x )) u€U - 

Direct computation of conditional entropy of children state subtrees given each state 

As an alternative, the entropies H(S c r u )\S u = j, X u = x u ) can be computed directly during the 
upward recursion given in Section l3~2l These are similar to the entropies H(SQ~ 1 \St = j,X$ = 
Xq), used in the algorithm of Hernando et al. (2005) in HMC models. Therefore, the following 
algorithm can be seen as a generalization of their approach to HMT models. Its specificity, 
compared with the approach based on the conditional entropies H(S u \S p ^, X = x). is that it 
does not require the results of the downward recursion. 

This upward algorithm is initialized at the leaf vertices u by 

H(S c(u )\S u = j,X u = x u ) = 0. 
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Since S c r u -\ and X \ u are conditionally independent given S u and X u , we have for any state 
j, H(S C ( U )\S U = j, X u = x u ) = H(S c ( u j\S u = j, X = x). Combining this equation with ([2"T]) 
yields 

H(S C ( U )\S V = j, X u = x u ) =H(S C ( U )\S U = j, -X"c(«) = x c (u)) 

= ^ H[S V \S U = j,X v = Xy), 

which is similar to the backward recursion (|30|) in time-reversed HMC models (see Appendix 

Moreover, for any v £ c(u) with c(v) 7^ and for j = 0, . . . , J — 1, 
H(S V \S U =j,X u = x u ) 

= — *S ' P(S c ( v 'j = s c („), S v = SujS'u = j, X u = x u ) 

S cW ,s v 

X \ogP(S c ( v -) = S c (^), = S^/S^ = j, X u = x u ) 
= — S ' S ' P(S C ( V ~) — s c (yj \S V = k, S u = j, X u = x u )P(S v = k\S u = j, X u = x u ) 

x {logP(5 c(l) ) = s c ( v )\S v = k,S u = j,X u = x u ) + log P(S V = k\S u = j,X u = x u )} 

= P(S V = k\S u = j, X v = x v ) < ^ P(S c (v) = s c ( v )\S v = k,X v = x v ) 
k (»c(«) 

x logP(S c(v) = s c(v) \S v = k,X v = x v ) + \ogP{S v = k\S u =j,X v = x v )\ 
= ^P{S V = k\S u = j, X v = x v ) {H(S C ( V )\S V = k,X v = x v ) 

k 

-logP(S v =k\S u =j,X v = x v )}. (24) 
Thus, the recursion of the upward algorithm is given by 

H(S c{u) \S u =j,X u = x u ) (25) 

= ^ y < ^ y P{S V = s v \S u = j, X v = x v ) [i?(S c („j \S V = s v , X v = x v ) 

- \OgP(Sy = Sy\S U = j, Xy = Xy)] ^ , 

where P(S V — k\S u = j, X v = x v ) = P(S V — k\S u = j, X — x) is given by equation (fT5|) . 
The termination step is obtained by similar arguments as equation (|2ip : 

H(S\X = x) =H[S <0) \S , X = x)+ H(S \X = x) 

= (j) {H (5 c(0) |5o = j,X = x) - log/? (?')} ■ 

o 

If each vertex has a single child, HMT and HMC models coincide, and equation (f2~5j) appears as 
a generalization of ([T4"]l for the computation of conditional entropies in time-reversed HMCs. 
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Using similar arguments as in the partial state tree entropy H(S U \X = x) can be 

deduced from the conditional entropies H{S C ( U )\S U = j,X u = x u ) (with j = 0, . . . , J — 1) as 
follows: 

H(S U \X = x) =H(S c{u) \S u ,X = x) + H(S U \X = x) 

= £ Wj) i H (Sc(u)\S u =j,X = x)- log£„ (j)} 

3 

= £f«0') {H (S C{U) \S U = j,X u = x u ) - log£„ (j)} , (26) 

3 

where the (£t»0'))j'=o,...,J— l are directly extracted from the downward recursion (|16[) . Moreover, 
since 

H(S \ U \X = x) =H(S \X = x)-H(S u \S \ u ,X = x) 
=H{S \X = x) - H(S u \S p(u) ,X = x) 

and since 

H(S u \S p{v) ,X = x) = H{S c(u) \S u ,X = x) +H(S u \S p(u) ,X = x), 

the partial state tree entropy H(Sq\ u \X = x) can also be deduced from the conditional entropies 
{H(S c(u) \S u = j,X u = x u )) j=0 J x using 

H(S \ U \X = x) (27) 

= H(S \X = x) -Y,Uj)H(S c(u) \S u = j,X u = x u ) - H(S u \S p[uh X = x), 

j 

but the computation of H(S u \S p ^, X = x) using (|17p is still necessary. 

In summary, the profile of partial subtrees entropies (7?(^ c („) \S U = j, X u = ^-u)) ugM . j=o J—x 

is firstly computed using (|2T|) . The profile of partial state tree entropies [H(S U \X = 
x ))ugu ^ s deduced from these entropies and the smoothed probabilities, using (|26|) . Computation 
of partial state tree entropies (H(So\ u \X = £C )) ugW and conditional entropies (H(S u \S p ( u -j, X = x)) 
still relies on (|2"5|) and (fT7)l . essentially, although variant l|27p remains possible. The time com- 
plexity of the algorithm is in 0(J 2 n). 

Entropy profiles conditioned on the children states in HMT models Up to this point, 
the proposed profile of conditional entropies has the property that global state tree entropy is 
the sum of conditional entropies. This is a consequence of Corollary [1] which translates into 
HMT models by profiles of state entropy given the parent state. 

However, as will be shown in Section 0] (Application), the state uncertainty at vertex u may 
be better explained by the values of children states than that of the parent state in practical 
situations. Consequently, profiles based on H (S u \S c r u ), -X" = x) have practical importance and 
are derived below. Since S u is conditionally independent from {S v } v£c ^ given S c ( a ) and X, 
we have H(S U \S C ( U ), X = x) = H(S U \S C ( U ^, X = x). This quantity, bounded from above by the 
marginal entropy H{S U \X = cc), is computed as follows: 

H(S U \S C ( U) ,X = x) = ~2J ^2 P ( Su = 3,S c {u) = s c ( u) \X = x) 

x \ogP(S u = j\S c ( u ) = s c{u) ,X = x), 
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with 

P(S u =j,S c(u) =s c(u) \X=x)=Uj) II P{S v =s v \S u =j,X = x) (28) 

? " U; |f ,P(S v = s v )f} u , v (j) 

v£c(u) 

from equation (|18p . and where equation (|28[) comes from conditional independence of {S v } v£c ^ 
given S u . The quantities f3 v {k), P p ( v ), v {j) and -P(SV; = fc) are directly extracted from the upward 
recursion in Section 13.21 Consequently, 

uj) n [pjsJPu.aj)] 

P(S U = j\S c ( u ) = s c ( u ),X = x"- 



k vGc(u) 



Note that the time complexity of the algorithm for computing entropy profiles conditioned on 
the children states is in 0{ J c+1 n) in the case of c— ary trees. This makes it the only algorithm 
among those in this article whose complexity is not in 0(J 2 n). 

The profiles based on H(S U \S C ^, X = x) satisfy the following property: 

Proposition 2 

H(S\X = x) <Y J H{S U \S <U) ,X = x) 

a 

where H(S U \S C ( U ^ X = x) must be understood as H(S U \X = x) if u is a leaf vertex. 

Thus, these entropies cannot be interpreted at the local contribution of vertex u to global state 
tree entropy H(S\X = x), unless equality is obtained in the above equation. (For example, if T 
is a linear tree, or in other words a sequence.) To assess the difference between the right-hand 
and the left-hand parts of the above inequality in practical situations, numerical experiments are 
performed in Section 2] (Application) . 

A proof of Proposition [5] is given in Appendix IA.2I A consequence of this inequality is that 
factorization 

P{S U = j, 'S'c(u) = s c(u) \ X = X) 

= P(S U =i|S c ( u ) = s c („),-X" = x)P(S c ( u ) = s c ( u )\X = x) 

cannot be pursued through a recursion on the children of u. Essentially, this comes from the fact 
that any further factorization based on conditional independence between the (5 , „)„ gc ( u ) must 
involve S u . 



4 Applications of entropy profiles 

To illustrate the practical ability of entropy profiles to provide localized information on the state 
sequence uncertainty, two cases of application are considered. The first case consists of the HMC 
analysis of the earthquake dataset, published by Zucchini & MacDonald (2009). The second case 
consists of the HMT analysis of the structure of pine branches, using an original dataset. It is 
shown in particular that entropy profiles allow regions that are non-ambiguously explained by 
the estimated model to be differentiated from regions that are ambiguously explained. Their 
ability to provide accurate interpretation of the model states is also emphasized. 
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4.1 HMC analysis of earthquakes 

The data consists of a single sequence of annual counts of major earthquakes (defined as of 
magnitude 7 and above) for the years 1900-2000; see Figure O 



45 




I 

1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 

Year 



Figure 2: Earthquake data: Restored state sequence represented as step functions, the level of the 
segments being either the parameter Xj of the Poisson observation distributions corresponding to 
the restored state j or the empirical mean estimated for the segment. 



A 3-state stationary HMC model with Poisson observation distributions was estimated on the 
basis of this earthquake count sequence and the estimated parameters of the Poisson observation 
distributions were Ai = 13.1, A2 = 19.7 and A3 = 29.7. The restored state sequence is represented 
in Figure [5] as step functions, the level of the segments being either the parameter Aj of the 
Poisson observation distributions corresponding to the restored state j or the empirical mean 
estimated for the segment. The state profiles computed by the forward backward algorithm 
{P (S t = j\X = x) ; j = 0, . . . , J - 1; t = 0, . . . ,T - 1} are shown in Figure El The entropy of 
the state sequence that explains the observed sequence for the estimated HMC model is bounded 
from above by the sum of the marginal entropies 

H(S^~ 1 \X = x) =J2 H (St\S t -i,X = x) = 14.9 
t 

< J^H (S t \X = x) = 19.9. 
t 

For this example, we chose to show the entropies conditional on the past, which are the only 
meaningful conditional entropies. Since log J is an upper bound on H(St\X = x), the scale of 
these entropy profiles is in theory [0,log3]. However the scale of the entropy profiles is rather 
[0,log2], since in practice at most two states can explain a given observation equally well; see 
Figure 0J 

Ignoring the dependency structure within the model to assess state uncertainty leads to strong 
overestimation of this uncertainty. This is highlighted in Figure0]by the comparison of the profile 
of entropies conditional on the past and the profile of marginal entropies, and in Figure [5j by 
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the comparison of the profile of partial state sequence entropies and the profile of cumulative 
marginal entropies. It should be recalled that the marginal entropy profile is a direct summary 
of the uncertainty reflected in the smoothed probability profiles shown in Figure [3] Hence, such 
profiles should be interpreted with caution. 




conditional entropy 

marginal entropy 
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Year 



Figure 4: Earthquake data: Profiles of entropies conditional on the past and of marginal entropies. 
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Figure 5: Earthquake data: Profiles of partial state sequence entropies and of cumulative marginal 
entropies. 



4.2 Analysis of the structure of Aleppo pines 

The aim of this study was to provide a model of the architecture of Aleppo pines. The data 
set is composed of seven branches of Allepo pines (Pinus Halepensis Mill.. Pinaceae) planted 
in the south of France (Clapiers, Herault). The branches come from seven different individuals 
aged between 35 to 40 years. They were described at the scale of annual shoot, defined as the 
segment of stem established within a year. Five variables were recorded for each annual shoot: 
length (in cm), number of branches per tier, number of growth cycles and presence or absence of 
female cones and of male cones. During a year, the growth of an annual shoot can occur in one 
to three cycles. An annual shoot with several growth cycles is said to be polycyclic. The number 
of growth cycles beyond the first one corresponds to the third recorded variable. On these seven 
branches, a total of 836 annual shoots was measured. 

4.2.1 Competing models 

An HMT model was estimated on basis of the seven branches, to identify classes of annual shoots 
with comparable values for the variables, and to characterize the succession of the classes within 
the branches. The branches were considered as mutually independent random realizations of a 
same HMT model. The emission distributions were multinomial distributions A4(l;pi, . . . ,Pv) 
for each variable but the length variable, where V denotes the number of possible values for this 
variable. The length variable, if included in the model, was assumed to follow a negative binomial 
distribution, given the state. The five variables were assumed independent given the state. The 
number of HMT states could not be deduced a priori from biological arguments, so it had to be 
determined using statistical criteria. We resorted to the Bayesian Information Criterion (BIC) to 
select this number. Although the consistency of BIC was proved for a restricted family of HMC 
models only (see Boucheron and Gassiat, 2007), its practical ability to provide useful results is 
established (see e.g. Celeux and Durand, 2008). The maximal number of possible states was 
set to 10. For HMT models where the length variable was discarded, BIC selected a 5-state or 
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a 6-state model (with respective values of BIC -2,047 and -2,039). The third best model had 4 
states, with a BIC value of -2,074. In the case of models including the length variable, a 6-state 
model was selected (with a BIC value of -10,541) followed by 4-state and 5-state models (with 
respective values of BIC -10,545 and -10,558). Note that since the estimated HMT models were 
not ergodic, the theoretical properties of BIC are not established. 

4.2.2 Entropy profiles in the 5-state HMT model without length variable 

The estimated transition matrix of the 5-state HMT model is 
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and the Markov tree is initialized in state with probability 1. It can be seen from P that the 
Markov tree has transient states and 1 and an absorbing class {2; 3; 4}, in which the states 
alternate quasi systematically. 

Female cones are potentially present in state only (in state 0, a shoot has female cones with 
probability 0.14). Male cones are potentially present in state 4 only (a shoot has male cones with 
probability 0.66). Besides, state is characterized by a high branching intensity (0 to 8 branches) 
and frequent polycyclism (a shoot is polycyclic with probability 0.95). State 1 is characterized 
by intermediate branching intensity (0 to 3 branches, unbranched with probability 0.67) and 
monocyclism. State 2 is characterized by intermediate branching intensity (0 to 4 branches, 
unbranched with probability 0.81) and rare polycyclism (a shoot is polycyclic with probability 
0.06). States 3 and 4 are always monocyclic, and are mostly unbranched (with probability 0.94 
and 0.98, respectively). As a consequence, any unbranched, monocyclic, sterile shoot can be in 
any of the 5 states (respectively with probability 0.002, 0.248, 0.281, 0.346 and 0.123). 

From a biological point of view, this model highlights a gradient of vigour, since the states 
are ordered with decreasing number of growth cycles and branches. This also predicts that class 
{2; 3; 4} is composed by sterile shoots that have potential polycyclism, alternating with sterile 
monocyclic shoots, and finally shoots with potential male sexuality. 

In the dataset, shoots with male cones (referred to as male shoots hereafter) systematically 
follow sterile shoots. Moreover, they are either located at the tip of a branch, or followed by a 
unique sterile shoot. This is a consequence of a particular measurement protocol for this dataset, 
in which individuals were measured just after the occurrence of the first male cones. In contrast, 
the infinite alternation of two sterile shoots and one male shoot predicted by this model cannot 
be considered as a general pattern in the pine architecture. A more relevant hypothesis is that 
after several years of growth, only unbranched monocyclic sterile shoots are produced (or maybe 
a mixture of both such male and sterile shoots). 

To analyze how state ambiguity due to unbranched, monocyclic, sterile shoots affects state 
restoration, entropy profiles were computed for each branch. Firstly, the annual shoots were 
represented using a colormap, which is a mapping between colours and the values of conditional 
entropies H(S u \S p ^, X = x) (see Figure^) ). Vertices with lowest conditional entropy are 
represented in blue, whereas those with highest conditional entropy are in red. In a similar way, 
the marginal entropy could also be represented using a colormap. 

The most likely state tree for each individual was computed using the Viterbi algorithm 
for HMT models (Durand et at, 2004). This state tree is represented in Figure [B}d). This 
representation shows where the states are located within the tree; for example state is located 
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a) b) 

Figure 6: Conditional entropy and state tree restoration for a given branch, a) Conditional 
entropy H(S u \S p f u -\, X = x) using a colormap. Blue corresponds to lowest entropy and red to 
highest entropy, b) State tree restoration. The correspondence between states and colours is as 
follows: state - green ; state 1 - red ; state 2 - blue ; state 3 - yellow ; state 4 - magenta. 



on the main axis (main stem) and at the basis of lateral axis. Moreover, in conjunction with 
Figure^), it highlights some states for which the restoration step is not much ambiguous (in our 
example, state 0, and to a least extent, state 4). Thus, these states with low entropy correspond 
to vertices with the highest number of branches, female or male cones. On the contrary, the 
vertices with highest entropy are mostly unbranched, monocyclic and sterile, and are located at 
peripheral parts of the plant. 

Using the conditional entropy in Figure^), peripheral vertices with maximal or minimal 
conditional entropy can be selected. To further interpret the model with respect to the data, 
entropy profiles were computed along paths leading to these vertices. These profiles were com- 
plemented by so-called upward- downward Viterbi profiles. These profiles rely on the following 
quantities 

max P((S V = s v ) V7 t u , S u = j\X = x), 
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for each state j and each vertex u of the tree. Their computation is based on upward and 
downward dynamic programming recursions, similar to that of Brushe et al. (1998), and are not 
detailed in this paper. Such profiles provide an overview of local alternatives to the state tree 
restoration given by the Viterbi algorithm. They were used by Guedon (2007) as diagnostic tools 
for localization of state uncertainty in the context of hidden (semi-) Markov chains. A detailed 
analysis of the state uncertainty is provided by the entropy profiles. 

Female shoots To illustrate how entropy reduction and Viterbi profiles are connected, an 
example consisting of a path containing a female shoot is considered. This path corresponds 
to the main axis of the third individual (for which H{S\X = x) = 52.9). The path contains 6 
vertices, referred to as {0, . . . , 5}. The female shoot is at vertex 2, and vertex 3 is a bicyclic shoot. 
Since a female shoot necessarily is in state 0, H(S2\X = x) = (no uncertainty). Since state is 
quasi systematically preceded by state 0, shoots and 1 are in state with a very high probability 
and again, H(S U \X = x) m for u = 1, 2. Shoots 3 is bicyclic, and thus is in state with a 
very high probability (H(Ss\X = x) w 0). Shoots 4 and 5, as unbranched, monocyclic, sterile 
shoots can be in any state. However, due to several impossible transitions in matrix P. only 
the following four configurations have non-negligible probabilities for (S4, S5) : (2, 3), (1, 1), (1, 2) 
and (3,4). This is partly highlighted in Figure [7] c) by the Viterbi profile, and results into high 
mutual information between S4 and S5 given X = x. For example, P(S§ — 3\Si = 2,X = x). 
P(S 5 = 4|5 , 4 = 3,X = x) and P{S 5 _g {1,2}|S4 = 1,X = x) are very close to 1. Thus the 
downward conditional entropy H(Ss\Sq\ 5 , Xq = Xq) = H(S^\Si, Xo = Xq) = 0.1, whereas 
H(Ss\Xq = Xq) = 0.8. Similarly, the upward conditional entropy H (S^S^^, _X"o = Xq) = 
H(S4\S5, Xq = Xq) is 0.5 whereas H(S4\S5, Xq = Xq) = 1.1 - see both entropy profiles in 
Figure [7] a) and b) . Since there practically is no uncertainty on the value of S3 , the mutual 
information between S3 and S4 given X = x is very low. 

Using equation (f2"Tj) . the contribution of the vertices of the considered path V to the global 
state tree entropy can be computed as: 



and is equal to 1.24 in the above example (that is, 0.21 per vertex on average). The global state 
tree entropy for this individual is 0.37 per vertex, against 0.38 per vertex in the whole dataset. 

The contribution of V to the global state tree entropy corresponds to the sum of the heights 
of every point of the profile of entropy given parent state in Figure [7Jb) . The mean marginal 
state entropy for this individual is 0.44 per vertex, which strongly overestimates the mean state 
tree entropy. 

4.2.3 Entropy profiles in the 6-state HMT model without length variable 

To assess the ability of the 5-state and the 6-state HMT models to provide state restorations 
with low uncertainty and relevant interpretation of the results, both models are compared using 
entropy and Viterbi profiles. 

The estimated transition matrix of the 6-state HMT model without the "length" variable is 
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Figure 7: Entropy profiles along a path containing a female shoot, obtained with a 5-state HMT 
model without the "length" variable, a) Marginal and conditional entropy given children states, 
b ) Marginal and conditional entropy given parent state, c ) State tree restoration with the Viterbi 
upward- downward algorithm. 



and the Markov tree is initialized in state with probability 1. It can be seen from P that the 
Markov tree has transient states and 1 and an absorbing class {2; 3; 4; 5}. Any return from 
state 3, 4 or 5 to state 2 is actually rare, and states 3 to 5 alternate most of the time. 

Female cones are potentially present in state only (a shoot has female cones with probability 
0.22 in state 0). Male cones are potentially present in state 5 only (a shoot has male cones with 
probability 0.62). Besides, state is characterized by a high branching intensity (0 to 8 branches) 
and frequent polycyclism (a shoot is polycyclic with probability 0.92). State 1 is characterized by 
low branching intensity (0 to 2 branches, unbranched with probability 0.56) and monocyclism. 
State 2 is characterized by intermediate branching intensity (0 to 6 branches, unbranched with 
probability 0.21) and bicyclism (a shoot is bicyclic with probability 0.99). States 3 to 5 are always 
monocyclic, and are mostly unbranched (with probability 0.87, 0.94 and 0.98, respectively). As 
a consequence, any unbranched, monocyclic, sterile shoot can be in any of the states 0, 1, 3, 
4 and 5 (respectively with probability 0.003, 0.205, 0.316, 0.342 and 0.134). States 1, 3 and 
4 have rather similar characteristics, although they slightly differ by their branching densities. 
These states are essentially justified by their particular positions in the plant. The role of state 
4 is mainly to represent the state-transition pattern 345, composed by two sterile and one male 
shoot. In usual Markovian modelling, a binary pattern 001 for the "male cone" variable could be 
modeled by a second-order markov model, or by a semi-Markov model with Bernoulli sojourn 
time in value 0. Here, since a first-order Markov tree is considered, state 4 may be thought of as 
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an additional necessary state to represent this pattern. 

From a biological point of view, the approximate reduction of the number of growth cycles 
and branches along the states is relevant. However, an absorbing class where two sterile shoots 
and one male shoot tend to indefinitely alternate does not seem justified. 

The global state entropy on the whole dataset is 0.36 per vertex on average. This quantity 
is slightly less than that of the 5-state HMT model. However, the entropy can increase locally 
on some particular paths. 

Female shoots An example consisting in the same branch and path than in Section 14.2.21 is 
considered (branch with a female shoot). Let us recall that the female shoot is at vertex 2, and 
vertex 3 is a bicyclic shoot. As in the case of a 5-state model, there is not much uncertainty on 
the state values at vertices to 3. Only three configurations have non-negligible probabilities for 
(5*4, S$) : (3, 4), (4, 5) and (3, 3). The last two configurations are at most 4 times less likely than 
the most likely configuration. As a consequence, the number and probabilities of the suboptimal 
state trees is lower for the 6-state model than for the 5-state model (see Figures [7] c) and[5]c)), 
and the values in the downward entropy profile are also lower (see Figures [7] b) and[8]b)). 




1 23450 1 2345 



a) b) 



- maximum posterior state probabilities 




c) 

Figure 8: Entropy profiles along a path containing a female shoot, obtained with a 6-state HMT 
model without the "length" variable, a) Marginal and conditional entropy given children states, 
b ) Marginal and conditional entropy given parent state, c ) State tree restoration with the Viterbi 
upward- downward algorithm. 

The global state tree entropy for this individual is 0.35 per vertex, and the contribution of the 
considered path to the global state entropy is 0.17 per vertex, which is lower than for a 5-state 
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model (i.e. 0.21 per vertex). 

4.2.4 Entropy profiles in the 6-state HMT model with length variable 

The estimated transition matrix of the 6-state HMT model with the "length" variable is 
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The hidden states and the state transitions, represented in Figure O have the following interpre- 
tation. The Markov tree is initialized in state with probability 1. It can be seen from P that 
the Markov tree has transient states 0, transient class {1, 2, 3}, and two absorbing states 4 and 
5. The only possible transitions to a previously- visited state are 2 — > 1, 3 — > 1 and 3^2. 

The states are ordered by decreasing length, except for state 5, which has slightly longer 
shoots than state 4. Female cones are potentially present in state only (a shoot has female 
cones with probability 0.13 in state 0). Male cones are potentially present in state 4 essentially, 
and any shoot in state 4 has male cones with probability 1. Male cones may also be present in 
states and 5 (with probability 0.02 and 0.03, respectively). Besides, state is characterized by 
a high branching intensity (0 to 8 branches) and frequent polycyclism (a shoot is polycyclic with 
probability 0.89). State 1 is characterized by intermediate branching intensity (1 to 3 branches, 
never unbranched) and monocyclism with rare bicyclism (a shoot is monocyclic with probability 
0.96). State 2 is characterized by low branching intensity (0 to 3 branches, unbranched with 
probability 0.74) and monocyclism with rare bicyclism (a shoot is monocyclic with probability 
0.9). States 3 to 5 are always monocyclic, and are mostly unbranched (with probability 0.94 and 
0.98, respectively). As a consequence, any unbranched, monocyclic, sterile shoot can be in any 
of the states 0, 2, 3 and 5 (respectively with probability 0.001, 0.261, 0.367 and 0.371). This 
characteristic of the model will be shown to be the source of state uncertainty for such shoots. 
States 3 and 5 differ mostly by their shoot length distributions. 

From a biological point of view, this model highlights a gradient of vigor, since the states 
are ordered by decreasing length, and also roughly by decreasing number of growth cycles and 
branches. The existence of an absorbing class corresponding to unbranched, monocyclic, shoots 
of short length (either male or sterile) predicted by this estimated HMT model is more consistent 
with biological a priori knowledge on Aleppo pine architecture, than the models in Sections l4.2.2l 
and EH 

A detailed analysis of state uncertainty has been performed on three paths (extracted from 
two distinct individuals) , chosen for the contrasted situations they yield: 

Case 1) Female shoots Firstly, the same path containing a female shoot as in Sections 14.2.21 
and 14.2.31 is considered. Let us recall that the female shoot is at vertex 2, and vertex 3 is a 
bicyclic shoot. Since a female shoot necessarily is in state 0, H(S2\X = x) = (no uncertainty). 
Since state is systematically preceded by state 0, shoots and 1 are in state with probability 
one and again, H(S U \X = x) = for u = 0, 1. Shoot 3 is bicyclic, and thus is in state with a 
very high probability (H(S3\X = x) w 0). Shoots 4 and 5, as unbranched, monocyclic, sterile 
shoots can be in any state, except states 1 and 4. However, due to several impossible transitions 
in matrix P, and given the lengths of these shoots, only the following three configurations have 
non-negligible probabilities for (S^Ss) : (5, 5), (2, 3) and (3,5). This is partly highlighted in 
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signature of the state emission distributions 




Figure 9: 6- state HMT model: transition diagram and symbolic representation of the state sig- 
natures (conditional mean values of the variables given the states, depicted by typical shoots). 
Dotted arrows correspond to transitions with associated probability < 0.1. Mean shoot lengths 
given each state are proportional to segment lengths, except for state (which mean length is 
slightly more than twice the mean length for state 1). 



Figure [TUb) by the Viterbi profile. As a consequence, S5 can be deduced from S4, which results 
into high mutual information between S4 and S5 given X = x. Thus the conditional entropy 
H(S 5 \S \ 5 ,X = x) = H(S_ 5 \Si,X = x) is 0.02, whereas H{S 5 \X = x) = 0.46. Similarly, the 
conditional entropy H{Si\S c u>, X = x) = HlS^S^, X = x) is 0.02. whereas H(Sa\X = x) = 
0.46, as illustrated by both entropy profiles in Figure ITOk) and b). Since there practically is no 
uncertainty on the value of S3 , the mutual information between S3 and S4 given X = x is very 
low. 

The contribution of the vertices of the considered path V to the global state tree entropy is 
equal to 0.48 in the above example (that is, 0.08 per vertex on average), which is far less than for 
both models without the "length" variable. The global state tree entropy for this individual is 
0.21 per vertex, against 0.20 per vertex in the whole dataset. This illustrates that incorporating 
the length variable into the HMT model strongly reduces uncertainty on the state trees. The 
mean marginal state entropy for this individual is 0.37 per vertex, which strongly overestimates 
the mean state tree entropy. 

Case 2) Sterile shoots Then, focus is put on a path essentially composed by monocyclic, 
sterile shoots in the fourth individual (for which H(S\X = x) = 47.5). The path contains 5 
vertices, referred to as {0, . . . , 4}. Shoots and 1 are long and highly branched, and thus are in 
state with probability « 1 (also, shoot is bicyclic). Shoots 2 to 4 are monocyclic and sterile. 
Shoots 2 and 3 bear one branch, and can be in states 1 or 2 essentially. Shoot 4 is unbranched 
and from the Viterbi profile in Figure [TTh h it can be in states 2, 3 or 5. This is summarized 
by the entropy profile in Figure [TTb). Since there is no uncertainty on Si, H(S2\S±, Xq = 
Xq) = H(S2\Si 1 Xq = Xq), as shown in Figure [11] b). Moreover, from the Viterbi profile, 
only the following three configurations for (S2, S3) have non-negligible probabilities: (2, 1), (1, 1) 
and (2,2), and S2 = 2 has highest probability. Since S3 cannot be deduced from S2 = 2, 
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Figure 10: Entropy profiles along a path containing a female shoot, a) Marginal and conditional 
entropy given children states, b) Marginal and conditional entropy given parent state, c) State 
tree restoration with the Viterbi upward- downward algorithm. 



H(Ss\S2, Xq = xo) is rather high. Similarly, only the following three configurations for (S3, S4) 
have non-negligible probabilities: (1, 5), (1, 2) and (2, 3) and S3 = 2 has low probability, so that 
H(S4,\S3, X = xq) is rather high. 

The profile H(S u \S c r u -\, X = x) in FigurelTTla) is interpreted as follows: the marginal entropy 
of S2 is high (0.61), and S2 cannot be deduced from S3. However, S2 can be deduced from a 
brother S3 of S3, such as S3 = 3 implies S2 = 2 and S 3 = 5 implies S2 = 1 (as would be shown 
by entropy profiles including S3). Hence, H(S2\S C (2), X = x) is low. This results into high 
mutual information between S2 and its children states given X = x, as illustrated in the profile 
Figure E]d). 

The contribution of this path to the global state tree entropy is 1.41 (that is, 0.28 per vertex 
on average) , which is higher than the contribution of the path containing a female cone considered 
hereabove. This is also higher than the mean contribution in the whole branch (that is, 0.24 per 
vertex) . This is explained by the lack of information brought by the observed variables (several 
successive sterile monocyclic shoots, which can be in states 1, 2, 3 or 5). The mean marginal state 
entropy for this individual is 0.37 per vertex, which strongly overestimates the mean state tree 
entropy. Note that the representation of state uncertainty using profiles of smoothed probabilities 
induces a perception of global uncertainty on the states along V equivalent to that provided by 
marginal entropy profiles. The discrepancy between the profile of partial state entropies along 
V and the profile of cumulative marginal entropies is highlighted in Figure llle ) . 
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Figure 11: Entropy profiles along a path containing mainly sterile monocyclic shoots, a) Marginal 
and conditional entropy given children states, b) Marginal and conditional entropy given parent 
state, c) State tree restoration with the Viterbi upward- downward algorithm, d) Mutual informa- 
tion between a state and its children states, e ) Profiles of partial state sequence and of cumulative 
marginal entropies. 



Case 3) Male shoots Finally, a path with a terminal male shoot included in the fourth 
individual is analyzed. The path contains 5 vertices, referred to as {0,...,4}. Shoots and 
1 are long and highly branched, and thus must be in state (also, shoot is bicyclic). Thus, 
H(S U \X = x) = for u = 0, 1. Shoot 2 is long and unbranched, and thus must be in state 2. 
Shoot 3 bears one branch, and can be in states 1 or 2 essentially (since S\ = 2 and P^,-! is low). 
As a male shoot, shoot 4 is in state 4 with a very high probability, or in state 5 otherwise and 
H{S 4 \X = x) = 0.08. Moreover, S 3 = if and only if S 4 = 4, thus H{S 4 \S 3 ,X = x) = = 



RR n° 7896 



32 



Durand & Guedon 



H(S 3 \S 4 ,X = x). 

Finally, the contribution of this path to the total entropy is 0.09 (i.e. 0.02 per vertex on 
average), which is negligible. This result is typical of male shoots, which mainly are in state 4, 
and since state 4 can only be accessed to from state 3. 
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Figure 12: Entropy profiles along a path containing path with a terminal male shoot, a) Marginal 
and conditional entropy given children states, b) Marginal and conditional entropy given parent 
state, c) State tree restoration with the Viterbi upward- downard algorithm. 



4.2.5 Comparison between entropy profiles conditioned on parent or children states 

As discussed in Section [31 the following inequality is satisfied, regarding entropy profiles: 

G(T) = H(S u \S p{u) ,X = x)< M(T) = J2 H ( S u\ X = 
ueT ueT 

that is, the global state tree entropy is bounded from above by the sum of marginal entropies. 
Let C(T) be defined as 

C{T) = Y, H ^u\S c{u) ,X = x). 

On the one hand, we have C(T) < M(T)- On the other hand, by Proposition [21 G(T) < 
C(T). To assess the overestimation of state uncertainty induced by using the profiles based on 
H(S U \S C ( U ), X = x) or H(S U \X = x) instead of H{S u \S p i u \,X = x), these quantities were 
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computed for each tree in the dataset, using the 6-state HMT model with the "length" variable 
given in Section jXj The ratio (C(T) - G(T))/G(T) and (M(T) - G(T))/G(T) are given in 
Table [U 



Tree T 


C(T) - G(T) 


M (T) - G(T) 


number 


G(T) 


G(T) 


1 


10.1 % 


69.1 % 


2 


30.9 % 


78.0 % 


3 


22.4 % 


76.4 % 


4 


16.2 % 


56.0 % 


■5 


6.5 % 


85.2 % 


6 


19.1 % 


73.5 % 


7 


26.6 % 


85.1 % 



Table 1: Comparison between entropy conditioned on parent state, children states, and marginal 
entropy. (C{T) — G(T)) / G{T) represents the relative distance between conditional entropy given 
the children states and conditional entropy given the parent state (taken as reference) . (M(7~) — 
G(7~))/G(l~) represents the relative distance between marginal entropy and conditional entropy 
given the parent state. 

It can be seen from Table [Tj that G(T) is much closer from G(T) than M(T) is. As a conse- 
quence, profiles based on H{S u \S c r u ),X = x) provide moderate amplification of the perception 
of state uncertainty in our example. By contrast, M(T) is a poor approximation of the global 
state tree entropy. As a consequence, the smoothed probability profiles are irrelevant to quantify 
uncertainty related to the state tree. 

5 Conclusion and discussion 

5.1 Concluding remarks 

This work illustrates the relevance of using entropy profiles to assess state uncertainty in graph- 
ical hidden Markov models. It has been shown that global state entropy can be decomposed 
additively along the graph structure. In the particular case of HMC and HMT models, we 
provided algorithms to compute the local contribution of each vertex to this entropy. 

Used jointly with the Viterbi algorithm and its variants, these profiles allow deeper under- 
standing on how the model assigns states to vertices - compared to plain Viterbi state restoration 
and smoothed probability profiles. In particular, these profiles may highlight zones of connected 
vertices where marginal state uncertainty is not only related to the observed value at each ver- 
tex, but where concurrent subtrees are plausible restorations in this zone. Such situations are 
characterized by high mutual information between neighboring states. 

Equivalent algorithms remain to be derived for trees with conditional dependency between 
children states given parent state (in particular, for trees oriented from the leaf vertices toward 
the root), and in the case of the DAG structures mentioned in Section |3~T1 

5.2 Connexion with model selection 

Selection of the number of states In the perspective of model selection, entropy compu- 
tation can also appear as a valuable tool. If irrelevant states are added to a graphical hidden 
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Markov model, global state entropy is expected to increase. This principle can be extended to 
adding irrelevant variables (that is, variables that are independent from the states or condition- 
ally independent from the states given other variables). If the model parameters were known, 
adding such variables would not change the state conditional distribution. However, since the pa- 
rameters are estimated from a finite sample, estimation induces perturbations in this conditional 
distribution in the context of irrelevant variables, and the global state entropy tends to increase. 
This intuitive statement explains why several model selection criteria based on state entropy were 
proposed. Among these is the Normalized Entropy Criterion introduced by Celeux & Soromenho 
(1996) in independent mixture models. It is defined for a mixture with J components as 

NEC(J) = H ^ X = X) 



l ogfsj(x) -k>gf 6i (x) 

if J > 1, and has to be minimized. Here, 8j denotes the parameters of a J-component mixture 
model, ff)j its probability density function and 9j the maximum likelihood estimator of Oj. Note 
that H(S\X = x) also depends on f§ . The number of independent model parameters in 6j 
will be denoted by dj. For J = 1, NEC is defined as a ratio between the entropy of a mixture 
model with different variances and equal proportions and means, and the difference between the 
log-likelihoods of this model and a model with one component. 

The ICL-BIC is also a criterion relying on global state entropy, and must be maximized. It 
was introduced by McLachlan & Peel (2000, chap. 6) and is defined by 

ICL-BIC(J) = 21og/^(aj) - 2H(S\X = x) - djlog(n) 

where n is the number of vertices in X. 

Although both criteria were originally defined in the context of independent mixtures, their 
generalization to graphical hidden Markov models is rather straightforward. By favoring models 
with small state entropy and high log-likelihood, they aim at selecting models such that the 
uncertainty of the state values is low, whilst achieving good fit to the data. In practice, they 
tend to select models with well-separated components in the case of independent mixture models 
(McLachlan & Peel, 2000, chap. 6). 



Criterion 


Number of states 


4 


5 6 


7 


BIC 


-10,545 


-10,558 -10,541 


-10,558 


NEC 


0.48 


0.37 0.32 


0.46 


ICL-BIC 


-10,764 


-10,742 -10,704 


-10,814 



Table 2: Value of three model selection criteria: BIC, NEC and ICL-BIC, to select the number 
of states in the Aleppo pines dataset. 



A similar criterion based on minimization of a contrast combining the loglikelihood and state 
entropy in the context of independent mixture models was proposed by Baudry et al. (2008). 
Selection of the number of mixture components was achieved by a slope heuristic. 

Applied to the Allepo pines dataset in Section |4j BIC would assess the 4-state and the 6-state 
HMT models as nearly equally suited to the dataset, and in practice the modeller could prefer 
the more parsimonious 4-state model (see Table [5]). In contrast, NEC and ICL-BIC would select 
the 6-state HMT model, since it achieves a better state separation than the 4-state model, for 
equivalent fit (as assessed by BIC). 
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Let us note however that the BIC, NEC and ICL-BIC criteria are not suitable for variable 
selection, since the log-likelihoods of models with different number of variables cannot be com- 
pared. 



Selection of variables To decide which variables are relevant for the identification of hidden 
Markov chain or tree models with interpretable states, global state entropy can be regarded as 
a diagnostic tool. Adding irrelevant variables in the model expectedly leads to increasing the 
state entropy; consequently, if adding a variable results into a reduction of the state entropy, this 
variable can be considered as relevant. Moreover, the state space does not depend on the number 
of observed variables. This makes the values of H(S\X = x) and H(S\Y = y) comparable, even 
if the observed processes X and Y differ by their numbers of variables. 

To illustrate this principle, the following experiment was conducted: ten samples of size 836 
(same size as the dataset in the application of section 2]) were simulated independently, using a 
Bernoulli distribution with parameter p = 0.5. They where also simulated independently from 
the five other variables described in the application, and were successively added to the Aleppo 
pines dataset. 

After the addition of the i th Bernoulli variable Yi = (ii )U )i<«<836j a 6-state HMT model was 
estimated on the i + 5-dimensional dataset, and the total state entropy Hi was computed. This 
procedure was repeated ten times (i.e., samples (ii,j,tt)i<u<836 were simulated for additional 
variables i = 1, . . . , 10 and for replications j = 1, . . . , 10). Thus, 10 x 10 values of Hij were 
computed. For a given value of i, the observed variable was a i + 5-dimensional vector. For 
1 < j < 10, let Hq j = Hq be the state entropy yielded by the 6-state model in Section l4.2.4l using 
the original dataset. Its value does not depend on j. Only three values in (i?ij)i<i<io,i<j<io 
were below Hqj. To assess the increase in state entropy related to the inclusion of irrelevant 
variables, the following regression model was considered: 

Hij — ai + j3 + Si.j 

where the residuals (£i,j)i,j were assumed independent and Gaussian with mean and variance 
er 2 . The test of the null hypothesis TLq : a = against the alternative Hi : a e 1 had P-value 
10 -3 . The maximum likelihood estimate of a was a = 3.4. This result highlights that state 
entropy significantly increased with the number of additional variables. 




Figure 13: Global state entropy of the whole forest of state trees, for models with 4 to 6 states, 
including or not the length variable. 
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It can be seen from Figure [T3] that global state entropy (computed on the whole forest of 
state trees) was lowest for the 6-state HMT model including the "length" variable. Combined 
with Table [5J this figure confirms that this HMT model is the most relevant for the Allepo pine 
dataset, since the information criteria BIC, NEC and ICL-BIC selected 6-state HMT models, 
and since removing the "length" variable from this model increased state entropy. This 6-state 
HMT model also has the most relevant interpretation, as illustrated in Section [U 

This highlights the potential benefit of using entropy-based criteria in model selection for 
hidden Markov models. 
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A Proof of propositions 

A.l Algorithms for computing entropy profiles conditioned on the fu- 
ture in the case of hidden Markov chain models 

Algo 

This algorithm to compute H(S t+ \ \St = j,X t+ \ = x t+ ~i ) for j = 0, . . . , J — 1 and t = 
0, . . . , T — 1 is initialized at t = T — 1 and for j = 0, . . . , J — 1 as follows: 

H (St-i\St-2 = j,Xr-i = xt-i) 

= ~/]P (St-i = k\S T -2 = 3>Xt-i = x T _i)logP [S T -i = k\S T -2 = j,X T -i = x T -\) ■ 

k 

The backward recursion is achieved, for t = T — 2, . . . , and for j = 0, . . . , J — 1, using: 
H (Sf+ilSt = j, Xj~^ = x?+i) 

E P {S?+i = bJ+M = j, = xj-, 1 ) log P (S^ 1 = sl^lSt = j, Xf~ x l = ar^i 1 ) 

s t+ i,...,s T -i 

E E P = ^l^+i - k, S t = j, Xj^ 1 = x T t ^) P (S t+1 = k\St = j, X 

s t+2 ,...,s T -i k 

x {log P = s^\S t+ i =k,S t = j, Xj-i = xf-i) + log P (S t+ i =k\S t = j, Xj-i = x*-?) } 



T-l _ T-l> 
t+1 ~~ x t+l , 



= - J2 P (St+i = k\S t = j, X^ 1 = xj^ 1 ) P ( S ?+* = ^ISt+i = k, Xj^ = 



T T-1> 

St+2,---,«T-l 

T-l T-lio J„ vT-1 T-l\ , i D / c 7,1 o „• vT-1 T-ll 



{logP (Sj-J = = KXj^ 1 = x^ 1 ) + logP (S t +i = = i,^ 1 = x^ 1 )} 

E P = fc '^ = = x *+i) i H {^{St+i = k, X T t - 2 x = xj-, 1 ) 

k 

- log P (S t +i = k\S t = j, Xj- X l = xj^ ) } , (30) 



with 



P(S t +i =k\S t = 3\xJ~ 



L t+i 

= p (jqg_ = xf- 1 \s t+1 = k\s t = j) 

P(X?- 1 1 = xJ- 1 1 \S t = ] ) 
L t +i (fc) Pjk/Gt+i (k) 
E m L t+i (m)p jm /G t +i (m) ' 

Using a similar argument as in (|30p . the termination step is given by 
H(S^ 1 \X=x) 

= -Y / P(So=3\X = x)\ E P{i%- l = 8*- 1 \So=j,X?- 1 = g- 1 ) 



Si,...,St-1 



: logP (Sf- 1 = sf-^So = j, X T X - X = X*- 1 ) + logP (5 = j\X = x) 
J2Lo(3){H(S?- 1 = s T 1 - 1 \S =j,X?- 1 =xT-i)-lo g L (j)}. 
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A. 2 Entropy profiles conditioned on the children states for hidden 
Markov tree models 

A proof of Proposition [2] is given, in the case of binary trees for the sake of simplicity. 

Proof Let Ic (u) and rc (it) denote the two children of vertex u. Applying the chain rule on the 

children of the root vertex, we can write 

H (S\X = x) = H (S \S c (p),X = x) + H (S , j c (o)|<S c (j c (o))) S rc (o), X = x) 

+ H (5 rc (o)|S c (i c (o)), £ c (rc(0))> X = x) + H (-S c (; c (0)), S c ( rc (p))\X = x) . 

This decomposition is indeed not unique and we can choose to extract the conditional entropy 
corresponding to rc (0) before the conditional entropy corresponding to Zc(0). Applying the 
property that deconditioning augments entropy (Cover & Thomas, 2006, chap. 2) 

H (<S/c(0)Pc(Jc(0))i Src(0),X = X) < H (5/ c (o)|5 c (j c (o)),X = X) , 

H (S rc ( ) |5 c (j c ( )), S c (rc(o))> X = x) < H (5 rc ( ) |5 c ( rc ( )), X = x) , 

we obtain 

H (S\X = x)<H (So\S c{oh X = x)+H (S lc(0) \S c(lc{0)h X = x) 

+ H (Src^Sclrc^iX = x) + H (S c (j c ( )), ^c(rc(0)) \X = x) . 

Applying the same decomposition recursively from the root to the leaves and upper bounding 
on each internal vertex completes the proof by induction. ■ 
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